Skip to content

Use page tracking for snapshot and restore #683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

simongdavies
Copy link
Contributor

@simongdavies simongdavies commented Jul 1, 2025

This PR makes it so sandbox Snapshots no longer contain a full copy of memory at the time of snapshot. Instead, only pages that have been dirtied since the last snapshot are included. This improves performance due to less memory copying when restoring a snapshot, and also (usually) reduces memory consumption due to not having to keep full copies around.

The PR first introduces tracking of dirty pages. It's done on two levels:

  1. In the host manually when writing to shared memory.
  2. In the vm by the hypervisor.

The union is taken of these two sets to determine which memory has been dirtied since last snapshot.

Since snapshots no longer contain a full copy of memory, restoring memory from a snapshot is a bit more complicated.
The algorithm this PR uses goes something like this:

Function: Restore memory to snapshot N given that sandbox was most recently at snapshot M (N could be <, >, or == M)

Step 1: Determine pages to restore
	Start with all currently dirty pages
	If N == M: Only restore currently dirty pages
	If N < M (rollback): Add all pages that exist in snapshots M, M-1, ..., N+1 (walking backwards from M to N, exclusive of N)
	If N > M (fast-forward): Add all pages that exist in snapshots N, N-1, ..., M+1 (walking backwards from N to M, exclusive of M)
	If no previous snapshot (M = None): Add all pages that exist in N and all its ancestors

Step 2: Restore each page
	For each page p in the combined set:
		Search for page p starting from snapshot N and walking backwards through its ancestors (N, N-1, N-2, ...)
		If found in any snapshot: Restore from that snapshot's data
		If not found in any snapshot: Zero the page (return to initial state)
MSHV3 microbenchmarks vs main branch (on a VM)
azureuser@aure-linux3-dev-vm [ ~/hyperlight ]$ cargo bench --profile release -- --load-baseline new_pr_mshv --save-baseline new_main_mshv
    Finished `release` profile [optimized] target(s) in 0.16s
     Running benches/benchmarks.rs (target/release/deps/benchmarks-db619ff9b478f950)
Gnuplot not found, using plotters backend
guest_functions/guest_call
                        time:   [67.542 µs 68.089 µs 68.526 µs]
                        change: [−1.2082% +0.1264% +1.4700%] (p = 0.86 > 0.05)
                        No change in performance detected.
guest_functions/guest_restore
                        time:   [20.162 µs 20.293 µs 20.439 µs]
                        change: [+0.1955% +2.1241% +4.0820%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
guest_functions/guest_call_with_restore
                        time:   [157.94 µs 158.43 µs 158.92 µs]
                        change: [+56.387% +58.390% +59.707%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
guest_functions/guest_call_with_call_to_host_function
                        time:   [310.00 µs 311.67 µs 313.40 µs]
                        change: [−3.4428% −2.8296% −2.2254%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

sandboxes/create_uninitialized_sandbox
                        time:   [386.31 µs 387.47 µs 388.68 µs]
                        change: [+7.0542% +7.5942% +8.2346%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
sandboxes/create_uninitialized_sandbox_and_drop
                        time:   [448.14 µs 448.95 µs 449.91 µs]
                        change: [+6.6067% +6.8878% +7.1789%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe
sandboxes/create_sandbox
                        time:   [1.3778 ms 1.3898 ms 1.4019 ms]
                        change: [+4.1825% +5.9375% +7.7199%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
sandboxes/create_sandbox_and_drop
                        time:   [33.784 ms 34.339 ms 34.926 ms]
                        change: [−1.6906% +0.5839% +3.0143%] (p = 0.63 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

sandbox_heap_sizes/create_sandbox_default_heap
                        time:   [1.4358 ms 1.4487 ms 1.4623 ms]
                        change: [+6.5125% +7.9527% +9.3471%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
sandbox_heap_sizes/create_sandbox_50mb_heap
                        time:   [32.056 ms 32.142 ms 32.238 ms]
                        change: [+4.5466% +4.9531% +5.3459%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
sandbox_heap_sizes/create_sandbox_500mb_heap
                        time:   [299.92 ms 300.96 ms 302.29 ms]
                        change: [+1.9729% +2.4132% +2.9360%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
sandbox_heap_sizes/create_sandbox_995mb_heap
                        time:   [590.06 ms 590.81 ms 591.64 ms]
                        change: [+1.0550% +1.3629% +1.5978%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

2_large_parameters/guest_call_restore_5mb_params
                        time:   [52.219 ms 52.306 ms 52.396 ms]
                        change: [−31.848% −30.444% −29.144%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
2_large_parameters/guest_call_restore_20mb_params
                        time:   [211.97 ms 212.31 ms 212.71 ms]
                        change: [−13.969% −11.336% −8.5797%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
2_large_parameters/guest_call_restore_60mb_params
                        time:   [601.85 ms 605.99 ms 610.98 ms]
                        change: [+24.901% +26.688% +28.462%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high severe

guest_call_restore_heap_sizes/guest_call_restore_default_heap
                        time:   [160.57 µs 162.40 µs 164.57 µs]
                        change: [+70.745% +72.839% +75.092%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
guest_call_restore_heap_sizes/guest_call_restore_50mb_heap
                        time:   [417.92 µs 419.58 µs 421.69 µs]
                        change: [−80.442% −80.263% −80.071%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
guest_call_restore_heap_sizes/guest_call_restore_500mb_heap
                        time:   [2.7581 ms 2.7617 ms 2.7662 ms]
                        change: [−87.626% −87.576% −87.528%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
guest_call_restore_heap_sizes/guest_call_restore_995mb_heap
                        time:   [5.2304 ms 5.2474 ms 5.2754 ms]
                        change: [−90.122% −89.824% −89.518%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

snapshot/default_heap   time:   [22.415 µs 22.614 µs 22.880 µs]
                        change: [+8.7437% +10.398% +12.225%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
snapshot/50_mb_heap     time:   [273.43 µs 273.93 µs 274.57 µs]
                        change: [−98.709% −98.704% −98.699%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) high mild
  17 (17.00%) high severe
snapshot/500_mb_heap    time:   [2.5732 ms 2.5797 ms 2.5891 ms]
                        change: [−98.676% −98.664% −98.653%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
snapshot/995_mb_heap    time:   [5.0826 ms 5.0971 ms 5.1208 ms]
                        change: [−99.520% −99.512% −99.503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
KVM microbenchmarks vs main branch (on a wsl)
ludde@ludvig-thinkpad:~/hyperlight-new$ cargo bench --profile release -- --load-baseline new_pr --baseline new_main
    Finished `release` profile [optimized] target(s) in 0.54s
     Running benches/benchmarks.rs (target/release/deps/benchmarks-253a48cada671a19)
Gnuplot not found, using plotters backend
guest_functions/guest_call
                        time:   [13.908 µs 14.046 µs 14.200 µs]
                        change: [−23.042% −19.582% −15.931%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
guest_functions/guest_restore
                        time:   [9.6810 µs 9.8001 µs 9.9421 µs]
                        change: [−42.284% −40.142% −38.049%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
guest_functions/guest_call_with_restore
                        time:   [69.005 µs 70.373 µs 72.045 µs]
                        change: [+44.369% +58.109% +72.305%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe
guest_functions/guest_call_with_call_to_host_function
                        time:   [377.03 µs 382.01 µs 388.41 µs]
                        change: [−11.386% −8.8335% −6.2630%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

sandboxes/create_uninitialized_sandbox
                        time:   [226.99 µs 231.31 µs 236.45 µs]
                        change: [−3.1739% −0.3462% +2.5131%] (p = 0.81 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
sandboxes/create_uninitialized_sandbox_and_drop
                        time:   [227.20 µs 230.25 µs 233.84 µs]
                        change: [+5.6581% +8.5039% +11.149%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
sandboxes/create_sandbox
                        time:   [1.0948 ms 1.1111 ms 1.1297 ms]
                        change: [−7.9543% −3.7963% +0.8833%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
sandboxes/create_sandbox_and_drop
                        time:   [1.9554 ms 1.9835 ms 2.0152 ms]
                        change: [−4.1234% −1.1564% +1.6366%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

sandbox_heap_sizes/create_sandbox_default_heap
                        time:   [1.2194 ms 1.2454 ms 1.2765 ms]
                        change: [+9.0826% +14.088% +19.624%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
sandbox_heap_sizes/create_sandbox_50mb_heap
                        time:   [1.5619 ms 1.5915 ms 1.6229 ms]
                        change: [+2.9032% +5.6611% +8.6451%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
sandbox_heap_sizes/create_sandbox_500mb_heap
                        time:   [2.6730 ms 2.7462 ms 2.8405 ms]
                        change: [+12.363% +16.198% +20.804%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
sandbox_heap_sizes/create_sandbox_995mb_heap
                        time:   [3.7462 ms 3.8141 ms 3.8888 ms]
                        change: [+4.9046% +7.3384% +9.9302%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

2_large_parameters/guest_call_restore_5mb_params
                        time:   [91.571 ms 92.535 ms 93.557 ms]
                        change: [−51.474% −50.725% −49.930%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
2_large_parameters/guest_call_restore_20mb_params
                        time:   [356.37 ms 360.05 ms 363.86 ms]
                        change: [−11.216% −9.6227% −8.0211%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
2_large_parameters/guest_call_restore_60mb_params
                        time:   [1.0638 s 1.0767 s 1.0901 s]
                        change: [+13.741% +15.471% +17.179%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

guest_call_restore_heap_sizes/guest_call_restore_default_heap
                        time:   [79.753 µs 81.480 µs 83.377 µs]
                        change: [+78.458% +86.354% +93.528%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
guest_call_restore_heap_sizes/guest_call_restore_50mb_heap
                        time:   [86.856 µs 88.864 µs 91.050 µs]
                        change: [−98.410% −98.324% −98.241%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
guest_call_restore_heap_sizes/guest_call_restore_500mb_heap
                        time:   [99.096 µs 101.65 µs 104.43 µs]
                        change: [−99.818% −99.813% −99.808%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
guest_call_restore_heap_sizes/guest_call_restore_995mb_heap
                        time:   [112.63 µs 115.11 µs 118.16 µs]
                        change: [−99.907% −99.904% −99.900%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

snapshot/default_heap   time:   [10.766 µs 10.890 µs 11.032 µs]
                        change: [−53.139% −51.740% −50.439%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe
snapshot/50_mb_heap     time:   [12.270 µs 12.441 µs 12.641 µs]
                        change: [−99.866% −99.863% −99.860%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
snapshot/500_mb_heap    time:   [21.229 µs 21.496 µs 21.808 µs]
                        change: [−99.978% −99.977% −99.976%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe
snapshot/995_mb_heap    time:   [31.801 µs 32.432 µs 33.173 µs]
                        change: [−99.987% −99.987% −99.986%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
OLD PR Description This pull request introduces snapshotting and restoring of sandbox state using dirty page tracking rather than copying the entire memory state each time, in many/most scenarios where sandboxes have larger than default amounts of memory this results in better perforamnce and reduced memory usage.

However, due to inefficiencies in the way dirty page tracking works on mshv and the fact that the implementation of tracking is not done for Windows this is not universally true.

Included below are screenshots showing the difference in time taken for some benchmarks with and without dirty page tracking enabled, along with some explanations of the differences seen.

The changes comprise:

Host Shared Memory Dirty Page tracking

  • Added a custom signal handler for SIGSEGV to support dirty page tracking for host memory mapped into a VM. Updated documentation to reflect this change and added debugging instructions for handling SIGSEGV in GDB and LLDB.

VM Dirty page tracking

  • Enabled dirty page tracking for mshv and KVM drivers to track changes made to memory in the guest

Snapshot Management

  • Added a new snapshot manager module to create, manage and restore memory snapshots.

Benchmarking Enhancements

  • Refactored the guest_call_benchmark_large_param function to support benchmarks for multiple parameter sizes and added a new sandbox_heap_size_benchmark function to measure sandbox creation performance with varying heap sizes.
  • Introduced guest_call_heap_size_benchmark to evaluate guest function call performance with different heap sizes.

Performance Changes

KVM Before

kvm-before

KVM After

kvm-after

KVM shows massive positive changes in performance when creating large sandboxes and calling functions in sandboxes with large memory configurations , although not measured the amount of memory consumed should have reduced considerably. In the scenario where large parameters are passed to the sandbox there is a degradation of performance, this has not been investigated yet, it may be due to the page based mechanism for saving/restoring data being more expensive than copying and restoring all the data. Other work to make large parameter passing more efficient will likely have a large positive impact here as well.

There is some regression in performance for small/default sandbox sizes which is more than likely caused by the overhead of tracking and building/restoring page based snapshots for small memory vs. copying and restoring the entire memory.

mshv2 Before

mshv2-before

mshv2 After

mshv2-after

mshv3 Before

mshv3-before

mshv3 After

mshv-after

Both mshv2 and mshv3 show similar patterns to KVM in that the larger sandbox sizes show large improvements, the default/small sandboxes show regressions and the large parameter sizes show regressions, possibly for the same reason that the KVM one does , again no investigation done here yet.

The biggest difference between KVM and mshv is for two reasons, first when enabling dirty page tracking on mshv the first call to get dirty pages results in a returned bitmap showing all pages dirty, since this would cause us to snapshot all memory as a baseline after enabling dirty page tracking this PR gets the dirty pages immediately and then discards the result. This approach means that we have to make 2 calls to get dirty pages when we really only need one. The impact of this is quite large, especially with larger memory configurations since the call to get dirty pages seems to be O(n) where n is the number of pages in the memory configuration (regardless of if the pages are dirty or not), for a 950mb VM I observed ~1.4ms response for this call, however, this approach with larger sandboxes is still much quicker than copying all the memory. Fixing these issues in mshv will probably bring the performance much closer to KVM.

Windows 2025 Before

windows-before

Windows 2025 After

windows-after

Windows is largely either the same performance or has regressed, this is because the Windows implementation has not been done yet, at the moment each time dirt pages are requested Windows reports that all pages are dirty and the snapshots/restores are done on that basis, there is some overhead of this approach especially when restoring

@simongdavies simongdavies added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Jul 1, 2025
@simongdavies simongdavies linked an issue Jul 1, 2025 that may be closed by this pull request
@simongdavies simongdavies force-pushed the update-snapshot-and-restore branch 5 times, most recently from 7e85120 to 014eab3 Compare July 3, 2025 15:48
@simongdavies simongdavies changed the title WIP: Use page tracking for snapshot and restore Use page tracking for snapshot and restore Jul 3, 2025
@simongdavies simongdavies force-pushed the update-snapshot-and-restore branch from 014eab3 to 56fd983 Compare July 3, 2025 16:43
ludfjig

This comment was marked as outdated.

@ludfjig ludfjig force-pushed the update-snapshot-and-restore branch 16 times, most recently from 04995b0 to 894f5fe Compare July 14, 2025 23:48
@ludfjig ludfjig force-pushed the update-snapshot-and-restore branch from 0808c97 to ed8a9d2 Compare July 17, 2025 05:04
@ludfjig ludfjig force-pushed the update-snapshot-and-restore branch 5 times, most recently from 456d533 to 58418bb Compare July 17, 2025 20:40
Copy link
Contributor

@jprendes jprendes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished reviewing it yet, but these are my comments so far

Comment on lines +406 to +412
if base_pfn == u64::MAX {
base_pfn = mshv_region.guest_pfn;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively obtaining the guest_pfn of the first region.
Is there a guarantee on the order of the regions?
Why does this correspond to the guest page number?

Should this maybe be the smallest guest_pfn? (although I don't understand why 😅)

base_pfn = base_pfn.min(mshv_region.guest_pfn);

Copy link
Contributor

@ludfjig ludfjig Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no guarantee of the order currently, but it happens to be true that they are in order and are contiguous because of the way we build them. It's probably something we want to enforce at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we use a constant else where to specify the base address that we map to? In which case can we not compute the PFN based on the constant value (I think it may actually be zero) but the point is that if we base it on the constant and that ever changes or gets removed this code that depends on that will be impacted.

Copy link
Contributor

@ludfjig ludfjig Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first guest region's guest address is 0 (BASE_ADDRESS). The page frame number is obtained by right-shifting the guest address by 12. Note 1 << 12 == page size. I think it is better to calculate it directly like this, rather than assume that it is derived from BASE_ADDRESS, which may or may not be the case.

Copy link
Contributor

@jprendes jprendes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

Copy link
Contributor

@jprendes jprendes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Thanks for working on this!

I left some more comments, but in general it's great.

}
}
};
for page_idx in bit_index_iterator(&bitmap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to use a dependency like bitvec.
Then we can do something like this:

let mut sandbox_dirty_pages = BitVec::new();
for (i, mem_region) = self.mem_regions.iter().take(n_contiguos).enumerate() {
    let num_pages = mem_region.guest_region.len() / PAGE_SIZE_USIZE;
    let bitmap = ...;
    let mut bitmap = BitVec::from_vec(bitmap);
    bitmap.truncate(num_pages);
    dirty.extend_from_bitslice(bitmap.as_bitslice());
}

Ok(sandbox_dirty_pages)

current_dirty_pages: &[u64],
most_recent_snapshot: &Option<Arc<SharedMemorySnapshot>>,
) -> Result<u64> {
if self.sandbox_size() != shared_mem.mem_size() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we need to check more than just the sandbox size. It has to have the same layout.
What happens if we have a sandbox with 1M heap, 1k io, and 4k stack, and restore on it a sandbox with 1M stack, 4k io, and 1M heap? the main memory size would be the same, but then we would use the wrong Layout, right?

I think this was instroduced by the new snapshot API, we should create an issue to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think the fact that we don't support restoring out of branch is the key that avoid this issue for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you're right, this case is possible if we're restoring a brand new sandbox, with no previous snapshot, to a given snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be an option not supporting that case, just like we do with out of branch snapshots?
Or would things fundamentally break? like... not being able to restore the first ever snapshot

If it's an option, I'm happy for us to create an issue and defer fixing it.

@ludfjig ludfjig force-pushed the update-snapshot-and-restore branch 3 times, most recently from 28b4d42 to 6a5458a Compare July 18, 2025 19:32
@ludfjig ludfjig marked this pull request as ready for review July 18, 2025 20:39
mem_regions: Vec<MemoryRegion>,
n_initial_regions: usize,
// Size of the (contioous) sandbox mem_regions
Copy link
Contributor

@jprendes jprendes Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Size of the (contioous) sandbox mem_regions
// Size of the (contiguous) sandbox mem_regions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be contiguous not continuous?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the comment :-)

@ludfjig ludfjig force-pushed the update-snapshot-and-restore branch from 6a5458a to 6f17e8b Compare July 18, 2025 21:54
@ludfjig ludfjig force-pushed the update-snapshot-and-restore branch from 6f17e8b to 06e3edb Compare July 21, 2025 21:33
@@ -138,9 +214,98 @@ fn sandbox_benchmark(c: &mut Criterion) {
group.finish();
}

// Sandbox creation with different heap sizes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reorientate the benchmarks, we should have a set of tests that we perform on sandboxes of differing sizes , for each sized sandbox we should:
measure the create time
measure the drop time
have guest calls with a varying number of parameters that does not reset the sandbox
have guest calls with a varying number of parameters that does reset the sandbox.
repeat the above where there are 1-n host function calls made by the guest function
measure the cost of snapshotting when modifying varying proportions of the sandbox memory
measure the cost of restoring when modifying varying proportions of the sandbox memory.
measure the cost of calling with the largest parameters that sized sandbox can support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #722 for this

mem_regions: Vec<MemoryRegion>,
n_initial_regions: usize,
// Size of the (contioous) sandbox mem_regions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be contiguous not continuous?

Comment on lines +406 to +412
if base_pfn == u64::MAX {
base_pfn = mshv_region.guest_pfn;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we use a constant else where to specify the base address that we map to? In which case can we not compute the PFN based on the constant value (I think it may actually be zero) but the point is that if we base it on the constant and that ever changes or gets removed this code that depends on that will be impacted.

vm_fd.get_dirty_log(base_pfn, total_size, CLEAR_DIRTY_BIT_FLAG)?;
#[cfg(mshv3)]
vm_fd.get_dirty_log(base_pfn, total_size, MSHV_GPAP_ACCESS_OP_CLEAR as u8)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked at the perf here I wonder if we should at least measure what it would look like to track dirty pages in the guest, I think this should be reasonably doable now that the foundation has been done in the guest.

Copy link
Member

@syntactically syntactically Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simongdavies I am already working on (a superset of)) this

Copy link
Member

@syntactically syntactically Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, I wonder if a useful perf optimization is a handle to disable dirty page tracking because we expect that this sandbox will only ever be restored to a totally different one. Not sure if that would be useful ATM, but depending on how exactly we do snapshotting for AFD it could make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simongdavies I am already working on (a superset of)) this

To clarify: in order to achieve our security properties (i.e. an untrusted guest cannot corrupt the execution of or learn information from another guest, even one restored into the same sandbox), we can't trust the guest for dirty page tracking unless the guest is also managing copy-on-write with a hypervisor-enforced readonly underlying store. Therefore, I am looking at making the latter happen.

@@ -885,6 +914,42 @@ impl Hypervisor for HypervLinuxDriver {
self.interrupt_handle.clone()
}

// TODO: Implement getting additional host-mapped dirty pages.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this TODO mean?

Copy link
Contributor

@ludfjig ludfjig Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this here so we won't forget about it once additionally mmaped memory is not read-only, and we potentially want to be able to reset it as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest this comment be made a little bit more explicit about e.g. "TODO: When regions of memory other than the main guest memory can be writable, ensure we don't miss any dirty pages in them" or something like that?

@@ -750,6 +757,47 @@ impl Hypervisor for KVMDriver {
self.interrupt_handle.clone()
}

// TODO: Implement getting additional host-mapped dirty pages.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this TODO mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this here so we won't forget about it once additionally mmaped memory is not read-only, and we potentially want to be able to reset it as well

/// Get dirty pages as a bitmap (Vec<u64>).
/// Each bit in a u64 represents a page.
/// This also clears the bitflags, marking the pages as non-dirty.
/// TODO: Implement getting additional host-mapped dirty pages.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is about the additonal mmaped regions then I think at the moment these are read only so what is needed is metadata to recreate them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this here so we won't forget about it once additionally mmaped memory is not read-only, and we potentially want to be able to reset it as well

///
/// Additionally, writes to the returned slice will not mark pages as dirty.
/// User must call `mark_pages_dirty` manually to mark pages as dirty.
fn as_mut_slice(&mut self) -> &mut [u8] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get rid of this? This is one of the reasons that I preferred the abstraction that signal handling gave, I think that this has the potential for creating difficult to find bugs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, most other methods on SharedMemory use this to write memory. At least it's private now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mark it as unsafe since there is an internal invariant that it enforces, to make callers read the docs, etc? Not sure if it is strictly unsafe in the Rust sense, but I suppose if you wrote other code that depended on the snapshot behaviour having what it ought to you could perhaps construct something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, replace it with a version that takes offset/len and does the dirty marking itself... but that's probably not super ideal? Idk.

/// # Safety
/// This function is unsafe because it does not mark the pages as dirty.
/// Only use this if you are certain that the pages do not need to be marked as dirty.
#[instrument(err(Debug), skip_all, parent = Span::current(), level= "Trace")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What circumstances is this used under?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need these two for restoring a snapshot. Restoring a sandbox to a snapshotted state should not itself dirty the pages

// Prints the dirty bitmap in a human-readable format, coloring each page according to its region
// NOTE: Might need to be updated if the memory layout changes
#[allow(dead_code)]
pub(crate) fn print_dirty_bitmap(bitmap: &[u64], layout: &SandboxMemoryLayout) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we cant have this as a Debug trait impl?

Copy link
Contributor

@ludfjig ludfjig Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot implement Debug on Vec or slice. Would need a wrapper type. Also this takes a &SandboxMemoryLayout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance when creating "larger" Sanboxes
4 participants